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Abstract 

Background: It is scientifically and ethically imperative that the results of statistical analysis of biomedical research 
data be computationally reproducible in the sense that the reported results can be easily recapitulated from the study 
data. Some statistical analyses are computationally a function of many data files, program files, and other details that 
are updated or corrected over time. In many applications, it is infeasible to manually maintain an accurate and 
complete record of all these details about a particular analysis. 

Results: Therefore, we developed the rctrack package that automatically collects and archives read only copies of 
program files, data files, and other details needed to computationally reproduce an analysis. 

Conclusions: The rctrack package uses the trace function to temporarily embed detail collection procedures 
into functions that read files, write files, or generate random numbers so that no special modifications of the primary R 
program are necessary. At the conclusion of the analysis, rctrack uses these details to automatically generate a 
read only archive of data files, program files, result files, and other details needed to recapitulate the analysis results. 
Information about this archive may be included as an appendix of a report generated by Sweave or knitR. Here, 
we describe the usage, implementation, and other features of the rctrack package. The rctrack package is freely 
available from http://www.stjuderesearch.org/site/depts/biostats/rctrack under the GPL license. 



Background 

The ability to reproduce research results is a cornerstone 
of the scientific method. Research results that are not 
reproducible are generally considered invalid by the scien- 
tific community. The result of a laboratory investigation 
must be independently reproduced by others to be con- 
sidered valid. Similarly, the validity of a clinical research 
finding is established via recapitulation in multiple dis- 
tinct clinical research cohorts. 

The exciting advances in data collection biotechnolo- 
gies (microarrays, sequencing, proteomics, etc) present 
the biomedical research community with special chal- 
lenges related to data interpretation and reproducibility of 
research results. The interpretation of massive data sets 
requires the use of sophisticated computational and sta- 
tistical analysis methods. The data sets themselves evolve 
as new technologies are introduced, additional tissue sam- 
ples become available, clinical follow-up is updated, etc. 
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A specific analysis result may be a function of dozens 
of data files, program files, and other details. Without 
carefully tracking details about how those files are used 
to generate a specific result, it can be challenging if not 
impossible to describe how any particular research result 
can be recapitulated from the study data. This challenge 
represents a crisis for reproducibility as a pillar of the 
scientific method. Consequently, some journals are devel- 
oping policies to encourage computational reproducibility 
[1] and federal authority to audit the computational repro- 
ducibility of research results is being expanded [2]. Most 
recently, the National Cancer Institute issued guidelines 
that strongly encourage investigators to maintain high 
standards of computational reproducibility in their 'omics' 
studies [3,4]. 

Also, some recent mishaps in clinical cancer genomics 
research have shown that there is an ethical obligation 
to ensure that computational results are fully repro- 
ducible. The first indication of problems in a series of 
clinical trials was the inability to computationally repro- 
duce the results of the supporting scientific publications 
from the publicly available data sets [5,6]. Investigative 
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follow-up of the analysis result discrepancies showed that 
the supporting science was fundamentally flawed [7]. 
Subsequently, the publications have been retracted and 
the clinical trials closed [8]. Thus, it is ethically impera- 
tive that data analysis results be computationally repro- 
ducible before they are used to direct therapy in a clinical 
trial [2,9]. 

Several computational tools have been developed to 
facilitate reproducible computing [10]. For example, 
Sweave [11] uses R (www.r-project.org) to compute sta- 
tistical results and inserts them into a KTpX (http:// 
www.latex-project.org/) typesetting program that is sub- 
sequently converted into a PDF report. To enable this 
functionality, Sweave defines a special syntax to switch 
between R code and KTgX code within the same file. 
In this way, the Sweave program file directly documents 
the top-level R code used to generate the corresponding 
PDF report. Similarly, the R packages knitr [12] and lazy- 
Weave [13] enable one to computationally insert results 
determined with R into a Wild mark-up file and an 
open office document file, respectively. These literate pro- 
gramming [14] systems internally document exactly how 
the results in the report file were produced. These lit- 
erate programming environments significantly enhance 
the scientific community's ability to perform reproducible 
computing. 

However, literate programming systems fail to address 
other challenges to reproducible computing. As previ- 
ously mentioned, one analysis result may be a function of 
a very large number of input data files (such as microarray 
image files, sequence alignment files, etc) and computer 
program files (such as the primary analysis program, some 
locally developed R routines, specific versions of R pack- 
ages, etc). These files must be collected and archived to 
ensure that the result may be reproduced at a later date. 
Manually reading through a program to identify and col- 
lect all those files is very tedious and cumbersome. Thus, 
there is a need to computationally collect and archive 
those files. 

Some tools have been developed to support repro- 
ducible computing by automatically archiving files. 
CDE software (www.pgbovine.net/cde.html) automati- 
cally generates an archive folder that replicates the entire 
directory structure for every file used to execute a spe- 
cific Linux command so that the identical command can 
be executed on another Linux machine without any con- 
flicts [15]. The technical completeness of the CDE archive 
is very impressive and CDE is very helpful for many Linux 
users who wish to ensure the computational reproducibil- 
ity of their work. 

Nevertheless, CDE has several limitations and draw- 
backs. Obviously, CDE is not helpful for people who use 
operating systems other than Linux and Unix. Also, the 
archive folder of CDE is excessively redundant for some 



settings. For example, the CDE archive includes a copy 
of many installation files for every software program (R, 
MatLab, SAS, Java, etc) used to perform an analysis. 
As CDE is used to document many analyses for multi- 
ple projects by a group of users that all have access to 
the same programs (such as a department of biostatistics 
or computational bioloy), the redundancy of such pro- 
gram files in CDE archives will begin to unnecessarily 
strain the storage resources of the computing infrastruc- 
ture. This redundancy is amplified when the same large 
data files (such as genotype data files for genome-wide 
association studies) are present in multiple CDE archives. 
Additionally, collecting a large number of copies of pro- 
prietary software files in many CDE archives may inad- 
vertently pose some legal problems. Furtermore, although 
the CDE archive is complete from a technical comput- 
ing perspective, it still does not necessarily contain all the 
information necessary to exactly recapitulate the results 
of statistical analysis procedures such as bootstrapping, 
permutation, and simulation that rely on random number 
generation [16]. The initial seed and the random num- 
ber generator must be retained for such analyses to be 
fully reproducible and such information will not be stored 
in the CDE archive if the program files did not explic- 
itly set the value for the initial seed. Finally, CDE does 
not automatically generate any file to help a reviewer to 
understand the computational relationships among files 
such as indicating that specific program files generated 
specific result files. Understanding such computational 
relationships is necessary to critically evaluate the appro- 
priateness of the methods and the scientific validity of the 
results. 

We have developed the package rctrack (Additional 
file 1, also available from our website listed below) to auto- 
matically archive data files, code files, and other details 
to support reproducible computing for R software that 
is widely used for statistical analysis of biomedical data 
and is freely available for the Unix/Linux, Windows, and 
MacOS operating systems. The rctrack package may be 
used in conjunction with literate programming R pack- 
ages (Sweave, knitR) and does not require the user to 
modify previously written R programs. It automatically 
documents details regarding random number generation 
and a sketch of the R function call stack at the time that 
data, code, or result files are accessed or generated. These 
details are saved in an archive that includes additional files 
that have been automatically archived according to default 
or user-specified parameters to control the contents and 
size of the archive. The files in this archive can be used 
to develop a custom R package or other set of files to 
help others recapitulate the analysis results at a later date. 
Finally, the rctrack package also provides functions to 
audit one archive and to compare two archives at a later 
date. 
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Implementation 

Figure 1 illustrates the implementation of the rctrack 
package. The function begin . rctrack sets the ini- 
tial value of the seed for random number generation, 
creates a dedicated memory environment rc . env to 
store details that are subsequently collected, and issues 
a series of trace statements that temporarily embed a 
set of detail-collection procedures into the definitions of 
R functions that access files for reading/writing, gener- 
ate random numbers, and issue system calls. The detail- 
collection procedures are then executed as part of the 
functions that access files, generate random numbers, ini- 
tialize the graphic devices, and issue system calls until the 
end. rctrack command is executed. Thus, the detail 
collection is performed automatically without any special 
modification of the R program. 

After the program execution is complete, the 
end. rctrack statement creates an archive directory 
with a time-stamp suffix in its name. The end . rctrack 
function then accesses the information in rc . env 
to identify and copy files to this archive directory. 
The md5sum function is used to compute the 32-byte 
MD5 checksum for each file and these results are 
recorded in rc . env. The end. rctrack saves the R 
sessionlnf o ( ) , commandArgs ( ) , and the contents 
of rc . env to an Rdata file in the archive directory. Next, 
end. rctrack sets the the archive directory permis- 
sions to read only to prevent the user from inadvertently 



modifying the archive later. Finally, end. rctrack 
removes the rc . env environment and uses untrace to 
restore the original definitions of all the tracked functions. 

The rctrack package also includes tools to 
audit and compare archives at a later time. The 
rc . check . archive checks the existence, size, MD5 
checksum, and modification times of files that are listed in 
the Rdata file in the archive and reports any discrepan- 
cies in those parameters. The rc . compare . archives 
function compares two archives generated by the 
rctrack package. The rc . compare . archives 
function compares the size, MD5 checksums, and 
modification times of files that are present in each of 
two archives and also lists files that are present in one 
archive folder but not the other archive folder. Finally, 
the rc . check . zipf ile function checks the existence, 
size, MD5 checksum, and modification times of files in 
an rctrack zip file archive and compares them to the 
values of those file attributes that were recorded at the 
time the zip file archive was generated. These auditing 
functions are documented in the package. 

Main features 

Table 1 lists optional arguments of begin . rctrack 
that govern the detail collection and archiving. We 
anticipate that the rng.init, details . file, and 
archive . folder options will be of greatest interest 
to most users. The rng.init argument dictates the 



begin. rctrack(rng.init, details.file, archive.folder, options) 



rctrack.file 

Store details about file 
name, file properties, time, 
and call stack in rc.env 



Embed with trace 



File I/O Functions 

file, gzfile, xzfile, bzfile, url, 
pipe, fifo, pdf, postscript, 
bmp, png, tiff 



Initialize rc.env, 
a dedicated memory 
space to store details. 



rctrack. random 

Store details about seed 
and call stack in rc.env 



Embed with trace 



RNG Functions 

Set. seed, RNGkind, sample, 
runif, rnorm, ... 



Set the random seed 
to rng.init and store 
its value in rc.env. 



rctrack.call 

Store details about time 
and call stack in rc.env 



Embed with trace 



System Call Functions 

system, system2, shell, 
shell. exec 



Store function tracing 
details, details.file, and 
archive.folder in rc.env. 



my.program.R 



end.rctrackQ 



RNG 



rctrack.random 



System Call 
rctrack.call 



details > rc.env 



details > rc.env 



details > rc.env 



Create an archive folder named 
archive.foider-date-time 



Use rc.env to identify and copy I/O files 
to the archive folder. 



Save rc.env, current workplace, 
commandArgsO, and sessionlnfo() to 
details.file in the archive folder. 



Set archive permissions to read only. 



Use untrace to restore original function 
definitions and remove rc.env. 



Figure 1 Implementation of rctrack package. 
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Table 1 Optional arguments of begin.rctrack 



Argument 


Value 


Action 


rng . init 


a list with components 

seed, kind, and normal . kind 


specifies how random number generation 
will be initialized by set . seed 


details .file 


name of Rdata file to store details 


end . rctrack will save details to this file 


archive . folder 


character string name of 
the archive directory 


end . rctrack will create a directory 
with name defined by appending a time 
Stamp to archive .folder 


maxsize . archive 


a numeric with file size in bytes 


end . rctrack will NOT copy INPUT 
files larger than this size to 
the archive directory 


do . not . archive 


a vector of character strings 


files with extensions matching these strings 
will NOT be moved to the archive directory 


skip . file . calls 


a vector of character strings 
with function names 


a list of high-level file access statements 
for which details will NOT be collected 


skip . empty . description 


a logical (TRUE/FALSE) that 
indicates whether to skip detail 
collection for empty file descriptions 


if TRUE, then no details about file 
access events with empty descriptions 
(no file name) will be collected 


rng. trace 


a logical that indicates 
whetherto collect details 
about every random number 
generation event 


if TRUE, begin . rctrack will 

embed rctrack . random into each of the 

functions listed in the rng . functions 

j_ j_ 1 j_ 1 r II 

argument so that a record or every call 
to those functions is retained. 


rng. functions 


a vector of character strings with 
the names of functions that 
generate random numbers 


begin, rctrack will embed 
rctrack . random into each of these 
functions so that their use 
is documented in rc . env and 
details . file 


print . trace 


a logical that indicates 
whetherto print messages about 
detail tracking 


Messages will be issued if TRUE 



random number generator to be used and the initial 
seed for random number generation. The default seed is 
123456789 and the default generator is R's default genera- 
tor. The default of details . file is "rc. details. Rdata". 
The details will be saved to this file when end . rctrack 
is called. The default of archive . folder is NULL. 
With these defaults, an archive folder with a name of the 
form YYYY-MM-DD-HH-MM-SS (year-month-day-hour- 
minute-second) will be created in the present working 
directory at the time that end. rctrack is called. Oth- 
erwise, the name of the archive folder will have the form 
archive . f older- YYYY-MM-DD-HH-MM- SS. The 
details . file and copies of other files will be saved 
in this archive directory when end. rctrack is called. 
We provide defaults for rng. init, details, file 
and archive . folder for user convenience but 
nevertheless recommend explicitly setting the values of 
these arguments in most applications. 

There are other options that control the level and extent 
of detail collection and archiving. These options may be 
useful in certain settings. For example, the user can set 
up the maxsize . archive to control the maximum 
size of an INPUT file to be copied and frozen in the 



archive folder. This option allows the user to avoid fill- 
ing their disk space with multiple copies of very large 
data files, such as genotype data files for a genome- 
wide association study. The do . not . archive option 
also controls the automatic file archiving. This option 
specifies a character vector of file name extensions. Files 
with these extensions will not be automatically archived. 
This may be useful to prevent repeated archiving of indi- 
vidual genotype or expression signal files (such as an 
Affymetrix CEL file). The skip . file . calls option 
is a vector of character strings with function names. No 
details about files accessed through calls to these func- 
tions will be collected nor will files accessed through calls 
to these functions be archived. This option prevents the 
archive from being cluttered with many low-level files 
accessed by loading R packages with the library or 
require statements. Details about the usage of R pack- 
ages are captured separately through saving the results 
of the sessionlnf o ( ) command. Likewise, the option 
skip . empty . description prevents the collection 
of many unnecessary details about some technical low- 
level calls to the file function that do not actually 
read or write files. The rng. trace option indicates 
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whether or not to retain a record of every random num- 
ber generation event. The rng. trace option defaults 
to true. However, for some applications in some comput- 
ing environments, users may want to set rng . trace to 
FALSE so that calculations which are intensive in their 
use of random number can be more rapidly completed. 
The rng . functions option accepts a character vec- 
tor of names of functions that generate random numbers. 
This is the list of random number generating functions 
that will be traced (if rng. trace is TRUE) from the 
begin . rctrack statement until the end.rctrack 
statement. Finally, the print . trace option controls 
whether messages regarding the collection of repro- 
ducible computing details are issued while the program 
is running. Defaults have been provided for all of these 
options. 

Sometimes, R may be used to issue system commands 
to perform some calculation with other software (MatLab, 
PLINK, etc). The rctrack package will note that such 
an action has occurred but is currently unable to collect 
details about the actions of those other softwares. In some 
cases, it may be possible to write a wrapper function in R 
that will add information about what the other software 
does to the collection of details collected by rctrack 
(such as the input files and output files of the external soft- 
ware program). In other cases, it can be useful to have a 
record indicating that these actions have occurred so that 
the user can manually identify and archive the necessary 
details. 

Usage 

The rctrack package is extremely simple to use; it only 
requires loading the library and issuing begin . rctrack 
to start collecting details and issuing end. rctrack to 
save the details to an Rdata file, create a read only archive 
of the files used in the data analysis, and discontinue 
detail collection. Figure 2 shows how to use the rctrack 
package to track the computing details of an R program. 

In this example, R will collect details about every 
instance that my . program . R and all of its subprograms 
read or write a file, generate random numbers, or 
issue a system call. All these details, the starting 
value of the seed for random number generation, and 
the session information will be saved in a read-only 
file mydir-date-time/rc . details . Rdata, where 
date -time is a date-time stamp appended to the name 



library (rctrack) 

begin. ret rack (details . f ile="rc . details .Rdata" , 

archive . f older="mydir " ) 
source ( "my . program . R" ) 
end.rctrackO 
Figure 2 Use of rctrack package. 



of the archive directory mydir. Additionally, read-only 
copies of all input files below a specified size and all 
output files (regardless of size) will be placed in the 
archive directory mydir -date -time. (The permissions 
are automatically set to read-only to prevent the user from 
inadvertently deleting or modifying the archived files; it 
does not prevent the deliberate modification of archived 
files because the read-only permissions can be modified 
later.) Thus, mydir -date -time provides a complete 
record of the calculations performed by my . program . R 
at the date and time indicated in the directory name. 
The date-time stamp serves to identify specific versions 
of the analysis. Most importantly, no modifications need 
to be made to my . program . R for these detail tracking 
and archiving operations to occur. Thus, rctrack pro- 
vides a comprehensive solution for collecting and archiv- 
ing reproducible computing details while minimizing user 
burden. 

Results 

Figure 3 provides an illustrative example of the con- 
tents of my.program.R of Figure 2. The example includes 
random number generation (rnorm), writing a tabular 
output file (write . table), reading a tabular input file 
(read . table), and generating a graphical output file 
(jpeg). 



# Generate example data 
n=50 # No. of Subjects 
pt.id=l:n # Subject identifier 
x=rnorm(n) # Random "X" variable 
y=rnorm(n) # Random "Y" variable 

pt . dataO=cbind. data. frame (id=pt . id, x=x, y=y) 

# Write data to a file 
pt.file="pt.data.txt" 

write . table (pt . dataO ,pt . f ile , sep="\t " , 
row. names=F, col . names=T) 

# Import the data 

pt . datal=read. table (pt . f ile , sep="\t " , 
header=T, as . is=T) 

# Perform a Pearson correlation test 
cor . res=cor .test (x ,y ,data=pt . datal) 

# Generate a figure 

fig . f ile="scatterplot . jpg" 
jpeg(f ig.f ile) 
plot(pt.datal$x, 

pt . datal$y , 

xlab="x",ylab="y") 
dev. off 0 

Figure 3 Contents of my.program.R from Figure 2. 
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Table 2 shows an abbreviated version of the con- 
tents of the rc . file . details object that was 
stored in the rc . env environment and then saved to 
rc . details . Rdata. The rc . file . details object 
is a data, frame that contains a sketch of the call 
stack at the time a file was accessed (in the columns 
top. call, mid. call, and bot.call), the time the 
traced file input/output command was called, the file 
description (usually a file name), and how the file was 
opened (read or write). The full version contains the entire 
function calls and the full file path in the description 
(Figure 4). In this example, the file details indicate that the 
files were accessed during a call to Sweave (to generate 
this paper), which subsequently called read, table, 
write . table, or jpeg. 

Also, Table 3 shows an abbreviated version of the con- 
tents of the random, details object that was stored 
in the rc . env environment and then saved to the file 
rc . details . Rdata. Again, the random . details 
object is a data . frame that contains a sketch of the 
call stack at the time a random number generation oper- 
ation occurred. The random, details objective also 
contains the starting value of the random seed and indi- 
cates what random number generator was used. The top 
row indicates that the seed was set by begin . rctrack 
and subsequent rows are records of how rnorm was used 
to generate the data. The full version contains complete 
function calls in the sketch of the call stack. 

Finally, the file rc . details . Rdata contains addi- 
tional information that may be useful for documenting 
and achieving reproducible computing. In addition to all 
the information collected in the special rc . env mem- 
ory environment, it contains the R sessionlnfoO, 
Sys.infoO, and memory . size ( ) at the time 
end .rctrack was called. The Figure 4 shows the com- 
plete contents of rc. details. Rdata for the example 
shown by Figures 2 and 3. Additional file 2 is the archive 
generated by this example; Additional file 3 provides the 
program files and gives detailed instructions for how to 
execute this example. 

We also ran the simple example my.program.R in 
Figure 3 under CDE. The CDE archive was much larger 
than the rctrack archive. The rctrack archive contained 
8 files in one folder with a total size of 96 KB. The CDE 
archive contained 249 files in 207 folders with a total 



size of 173 MB, which is over 1,000 times larger than 
the rctrack archive. The bulk of the CDE archive con- 
tained environment files, shared libraries, R installation 
files, and R package files that are redundant for R users 
who already have those files or can easily obtain them. The 
exhaustive CDE archive does provide the advantage that 
others who do not have these R libraries can issue a single 
Linux command to repcapitulate the calculation within 
random number generation variability. The rctrack 
provides sufficient information for others who already 
have R and the necessary libraries to recapitulate the 
calculation with some modest efforts (such as installing 
the packages and revising the file paths in the code file). 
The rctrack package also documents the initial ran- 
dom seed and random number generator so that the 
calculation can be exactly recapitulated. Furthermore, any 
Linux users who wish to have all the information captured 
by both CDE and rctrack may run rctrack under 
CDE. 

Discussion 

Reproducible computing is an essential component of 
biomedical research in the 'big data' era. Literate pro- 
gramming systems such as Sweave and knitr internally 
document specific calculations at the top-level program 
code level. However, to completely achieve complete per- 
manent reproducibility, one must identify and archive 
all data files and other computational components of an 
analysis. The rctrack package provides a simple and 
effective computational approach to collect and archive 
the plethora of low-level details needed to achieve and 
document complete and permanent reproducibility of a 
statistical analysis. In particular, the analysis archive can 
be provided as Additional files of a published report to 
provide complete documentation for researchers, review- 
ers, supervisors, institutions, or regulatory agencies who 
are interested in recapitulating the analysis results or 
using the analysis methods in their own studies. A cus- 
tom R package is an excellent format in which to distribute 
these suppelementary materials; the rctrack package 
can help the user to create such a package by assembling 
all the necessary elements into a single archive file. This 
should greatly enhance the rigor and transparency of sci- 
entific discourse and expedite the evaluation and adoption 
of robust data analysis methodologies. 



Table 2 Abbreviated contents of rc.f ile.details 


top.call 


mid.call 


bot.call 


call.time 


description 


open 


1 Sweave 


source 


file 


Thu Mar 27 14:52:14 2014 


my.program.R 


r 


2 Sweave 


write.table 


file 


Thu Mar 27 14:52:14 2014 


pt.data.txt 


w 


3 Sweave 


read.table 


file 


Thu Mar 27 14:52:142014 


pt.data.txt 


rt 


4 Sweave 


eval 


jpeg 


Thu Mar 27 14:52:142014 


scatterplot.jpg 


w 
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Smaxsize. archive 
[1] le+07 

$rc. calls 
NULL 



Src. message. list 

[1] "utils: :Sweave(\"rctrack-BMC-revision-20140327.Rnw\" , encoding = \"IS08859-1\") issued at Thu Mar 27 14:52:14 2014" 

[2] "utils: :SweaveC\"rctrack-BMC-revision-20140327.Rnw\" , encoding = \"IS08859-1\") issued at Thu Mar 27 14:52:14 2014" 

Srunni ng. time 

user system elapsed 
0.82 0.07 1.01 



: Sweave (" ret rack- BMC- revi si on-2014032 7 . Rnw" , encodi ng = 

:Sweave("rctrack-BMC-revision-20140327.Rnw", encoding = 

: Sweave("rctrack-BMC-revisi on-2014032 7 . Rnw" , encodi ng = 

: Sweave ("re track- BMC- revi si on-2014032 7 . Rnw" , encodi ng = 

bot.call call . 



top. cal 1 
"IS08859-1") 
"IS08859-1") 
"IS08859-1") 
"IS08859-1") 



mi d. cal 1 

sou rce("V:/StatTools/ ret rack/manuscript/revi si on/my .program. R") 
:able(pt.dataO, pt.file, sep = "\\t" , row. names = F, col. names = T) 
read. tableCpt . fi le, sep = "\\t", header = T, as. is = ~0 



1 2014 

2 2014 

3 2014 

4 2014 



e(filename, "r", encoding = encoding) Thu Mar 27 14:52:14 2014 

fileCfile, ifelseCappend , "a", "»")) Thu Mar 27 14:52:14 2014 

fileCfile, "rt") Thu Mar 27 14:52:14 2014 

j peg (fig- file) Thu Mar 27 14:52:14 2014 ' 
ctime atime exe exist 



03-11 13 
-03-11 13 
03-11 13 
03-11 13 



25:33 2014-03-11 13:25:33 
25:33 2014-03-11 13:25:33 
25:33 2014-03-11 13:25:33 
:25:32 2014-03-11 13:25:32 



TRUE e98fb286af28475c38b0887bel9004a3 
TRUE Cf859bb327ada957dbcaf57flb88ba29 
true cf859bb327ada957dbcaf57flb88ba29 
TRUE 94bbl3c27daba2cafOd61f5 5dal8535c 



eval(expr, envir, enclos) 

description open size isdir mode 

v: /StatTool s/rctrack/manuscri pt/ revision/my. program. R r 645 false 

V:/StatTools/rctrack/manuscript/revision/pt.data.txt w 2014 FALSE 

V :/StatTools/ ret rack/manuscript/revi si on/pt. data, txt rt 2014 false 

/StatTool s/rctrack/manuscri pt/revision/scatterplot. jpg w 12388 false 
checksui 



mtir 



666 2014-03-21 15:56:05 
666 2014-03-27 14:52:14 
666 2014-03-27 14:52:14 
666 2014-03-27 14:52:14 



$s kip. empty .description 
[1] TRUE 

Sskip. file. calls 

[1] "library" "require" "help" 

Sdo . not . archi ve 
NULL 



Sdetails.file 

[1] "rc.details.Rdata" 



utils: :Sweave("rctrack-BMC-revision-20140327.Rnw" 
uti 1 s: : Sweave ("ret rack- BMC- revi si on-2014032 7 . Rnw" 



top. cal 1 
begi n. rctrackO 
encoding = "IS08859-1") eval(expr 
encoding = "IS08859-1") eval(expr 



ret- 



i d . cal 1 

,ck() begii 
nclos) 
nclos) 



Sarchive. folder 

[1] "V: /StatTool s/rctrack/manuscr 



Src. trace. list 
[1] "set. seed" 
[14] "rmultinom 1 
[27] "xzfile" 
[40] "svg" 



pt/revi si on/archive . folder-2014-03-27-14- 52-14" 



"RNGkind" "runif" 

"rnbinom" "rnorm" 

"pipe" "fifo" 

"cai ro_pdf" "cai ro_ps" 



"rbeta" 
"rpois" 
"pdf" 
"system" 



" posts cri pt" "xfig" 
"system2" "shel 1 " 



"rchisq" "rexp" 

"rwei bull " "sampl e" 

"pictex" "bitmap" 
"shell . exec" 



$archi> 



.files 



V:/statTool5 
v:/5tatTool 
V: /StatTool 

v:/StatTools/i 

mode 

. 666 2014-03-21 
666 2014-03-27 
666 2014-03-27 
666 2014-03-27 



orig .full . path 



bot.call 

rctrackO Thu Mar 

rnorm(n) Thu Mar 

rnorm(n) Thu Mar 



call .date seed kind 

27 14:52:14 2014 123456789 NULL 



rt r gamma 

"sample. int" "file" 
"dev2bitmap" "bmp" 



"rhyper" 
"gzfile" 
"png" 



lormal . ki nd 
NULL 

<NA> 
<NA> 



"rl norm" 
"bzfile" 
"ti f f " 



/rctrack/manuscri pt/revi si on/my .prog 
s/rct rack/manuscript/revi si on/pt .data. txt 
s/rctrack/manuscri pt/revi si on/pt .data. txt 
k/manuscri pt/revi si on/scatterpl ot . jpg 



archi ve .fi 1 e. path in 



V: /StatTool s/rct rack/manuscript/revi si on/archive. folder-2014-03-27-14-52-14/my.progn 
v: /StatTool s/rctrack/manuscri pt/revi si on/archive. folder-2014-03-27-14-52-14/pt. data. txt 
V: /StatTool s/rct rack/manuscript/revi si on/archive. folder-2014-03-27-14-52-14/pt. data. txt 
/StatTool s/rct rack/manuscript/revi si on/archi ve. fol der-2014-03-27-14-52-14/scatterpl ot . jpg 
mtime ctime atime exe exist checksum 

15:56:05 2014-03-11 13:25:33 2014-03-11 13:25:33 no TRUE e98f b286af28475c38b0887bel9004a3 
14:52:14 2014-03-11 13:25:33 2014-03-11 13:25:33 no TRUE cf 859bb327ada957dbcaf 57f Ib88ba29 
14:52:14 2014-03-11 13:25:33 2014-03-11 13:25:33 no TRUE cf 859bb327ada957dbcaf 57f Ib88ba29 
14:52:14 2014-03-11 13:25:32 2014-03-11 13:25:32 no TRUE 94bbl3c27daba2cafOd61f55dal8535c 



TRUE 
TRUE 
1 RUh 
TRUE 



open size i sdi r 
r 645 FAL5E 
w 2014 FALSE 
rt 2014 FALSE 
w 12388 FALSE 



Ssession. i nfo 

R version 2.15.1 (2012-06-22) 

Platform: x86_64-pc-mi ngw32/x64 (64-bit) 

locale: 

[1] LC_COLLATE=English_united States. 1252 LC_CTYPE-Engl i sh_uni ted States. 1252 LC_MONETARY-Engl i sh_uni ted States. 1252 LC_NUMERIC=C 
[5] LC_TIME=English_United States. 1252 

attached base packages : 

[1] tools stats graphics grDevices utils datasets methods base 
other attached packages: 

[1] rctrack_1.2 R. uti ls_1.29. 8 R.oo_1.17.0 R.methodsS3_l. 6 . 1 knitr_1.4.1 markdown_0. 6.3 

loaded via a namespace (and not attached): 

[1] digest_0.6.4 evaluate_0. 5 .1 formatR_0.10 stringr_0 . 6. 2 

Smemory . size 
[1] 29.53 



"7 x64" "build 7601, Servii 
ef fective_user 
"Zliu3" 



Figure 4 Appendix produced by rctrack. 



Table 3 Abbreviated contents of rc.random.details 

top.call mid.call bot.call call.date seed kind normal.kind 

1 begin.rctrack begin.rctrack begin.rctrack Mar 27 14:52:14 123456789 NULL NULL 

2 Sweave eval rnorm Mar 27 14:52:14 

3 Sweave eval rnorm Mar 27 14:52:14 



Liu and Pounds BMC Bioinformatics 2014, 15:138 
http://www.biomedcentral.eom/1 471 -21 05/1 5/1 38 
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Unlike CDE, the rctrack package captures the intial 
seed for random number generation. Many statistical 
analysis procedures such as permutation, bootstrap, and 
Markov chain Monte Carlo simulation rely explicitly on 
random number generation. The rctrack package cap- 
tures sufficient information to recapitulate random num- 
ber generation with the stats package on a single 
processor and to recapitulate parallel random number 
generation as implemented in the doRNG package. The 
rctrack package does not capture all seeds in the paral- 
lel implementation in doRNG but still captures the initial 
seed which is sufficient information to recapitulate the 
random number series. The rctrack package only mon- 
itors random number generation, it does not attempt to 
modify or otherwise control random number generation. 
Thus, rctrack will not introduce any problems for ran- 
dom number generation that is properly implemented in 
a serial or parallel manner. 

The current version of the rctrack package has some 
limitations that we wish to address in future versions. As 
previously mentioned, the user must modify the file paths 
in the archive to recapitulate the analysis. The rctrack 
package captures the complete original absolute file path 
and the archived file path for every file. We plan to use 
this information in a new version of the software that alle- 
viates the user of the burden of manually modifying the 
file paths. We will accomplish this by using the informa- 
tion to computationally redirect file paths. The current 
version does not collect details or archive files for cal- 
culations that were performed by external software via a 
system call from R but instead simply notes that sys- 
tem calls to external software were performed. We plan 
to explore and develop ways to use the Linux ptrace 
command and analogous features in Windows to capture 
relevant information from external software calls made 
via the R system command. Also, the current version of 
rctrack does not combine information from multiple R 
sessions that run independently in parallel. We are cur- 
rently developing routines that will combine information 
from such an arrray of R sessions. 

Conclusion 

The rctrack package was developed to minimize user 
burden so that it can be immediately useful in practice. 
The rctrack package only requires statements to ini- 
tiate and terminate collection of details for reproducible 
computing; no additional programming effort is required. 
Furthermore, the rctrack package may be used in con- 
junction with the literate programming packages Sweave, 
knitR, and lazyweave so that the reproducible comput- 
ing details may be incorporated as an appendix of the 
reports generated by those tools. Finally, the open source 
rctrack package provides a roadmap for advanced 
users to expand the capabilities to collect and archive 



other details about their specific calculations. Therefore, 
rctrack will be very useful in the modern era of 'big 
data'. 

Availability and requirements 

Project name: rctrack 

Project Homepage: http://www.stjuderesearch.org/site/ 

depts/biostats/rctrack 

Operating System: Platform independent 

Other requirement: R 2.15.0 or higher 

License: GPL 

Additional files 



Additional file 1 : rctrack package. This is the rctrack package 
Installation instructions are included in the file example and instruction jar. 

Additional file 2: Archive folder for the results. Archive folder 
generated by the rctrack package for the program my.program.R shown in 
Figure 3. 

Additional file 3: Example and instruction. This file includes the 
instructions for installing the rctrack package and how to perform the 
example shown in Figure 2. 

v ^ 
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